Programming with Python

Control Flow Statements

The control flow of a program determines the order in which lines of code are executed. All else being equal, Python code is executed linearly, in the order that lines appear in the program. However, all is not usually equal, and so the appropriate control flow is frequently specified with the help of control flow statements. These include loops, conditional statements and calls to functions. Let’s look at a few of these here.

for statements

One way to repeatedly execute a block of statements (i.e. loop) is to use a for statement. These statements iterate over the number of elements in a specified sequence, according to the following syntax:



In [ ]:

    
for letter in 'ciao':
    print('give me a {0}'.format(letter.upper()))

Recall that strings are simply regarded as sequences of characters. Hence, the above for statement loops over each letter, converting each to upper case with the upper() method and printing it.

Similarly, as shown in the introduction, list comprehensions may be constructed using for statements:



In [ ]:

    
[i**2 for i in range(10)]

Here, the expression loops over range(10) -- the sequence from 0 to 9 -- and squares each before placing it in the returned list.

if statements

As the name implies, if statements execute particular sections of code depending on some tested condition. For example, to code an absolute value function, one might employ conditional statements:



In [ ]:

    
def absval(some_list):

    # Create empty list
    absolutes = []    

    # Loop over elements in some_list
    for value in some_list:

        # Conditional statement
        if value<0:
            # Negative value
            absolutes.append(-value)

        else:
            # Positive value
            absolutes.append(value)
    
    return absolutes

Here, each value in some_list is tested for the condition that it is negative, in which case it is multiplied by -1, otherwise it is appended as-is.

For conditions that have more than two possible values, the elif clause can be used:



In [ ]:

    
x = 5
if x < 0:
    print('x is negative')
elif x % 2:
    print('x is positive and odd')
else:
    print('x is even and non-negative')

while statements

A different type of conditional loop is provided by the while statement. Rather than iterating a specified number of times, according to a given sequence, while executes its block of code repeatedly, until its condition is no longer true.

For example, suppose we want to sample from a truncated normal distribution, where we are only interested in positive-valued samples. The following function is one solution:



In [ ]:

    
# Import function
from numpy.random import normal

def truncated_normals(how_many, l):

    # Create empty list
    values = []

    # Loop until we have specified number of samples
    while (len(values) < how_many):

        # Sample from standard normal
        x = normal(0,1)

        # Append if not truncateed
        if x > l: values.append(x)

    return values



In [ ]:

    
truncated_normals(15, 0)

This function iteratively samples from a standard normal distribution, and appends it to the output array if it is positive, stopping to return the array once the specified number of values have been added.

Obviously, the body of the while statement should contain code that eventually renders the condition false, otherwise the loop will never end! An exception to this is if the body of the statement contains a break or return statement; in either case, the loop will be interrupted.

Generators

When a Python functions is called, it creates a namespace for the function, executes the code that comprises the function (creating objects inside the namespace as required), and returns some result to its caller. After the return, everything inside the namespace (including the namespace itself) is gone, and is created anew when the function is called again.

However, one particular class of functions in Python breaks this pattern, returning a value to the caller while still active, and able to return subsequent values as needed. Python generators employ yield statements in place of return, allowing a sequence of values to be generated without having to create a new function namespace each time. In other languages, this construct is known as a coroutine.

For example, we may want to have a function that returns a sequence of values; let's consider, for a simple illustration, the Fibonacci sequence:

$$F_i = F_{i-2} + F_{i-1}$$

its certaintly possible to write a standard Python function that returns however many Fibonacci numbers that we need:



In [ ]:

    
import numpy as np

def fibonacci(size):
    F = np.empty(size, 'int')
    a, b = 0, 1
    for i in range(size):
        F[i] = a
        a, b = b, a + b
    return F

and this works just fine:



In [ ]:

    
fibonacci(20)

However, what if we need one number at a time, or if we need a million or 10 million values? In the first case, you would somehow have to store the values from the last iteration, and restore the state to the function each time it is called. In the second case, you would have to generate and then store a very large number of values, most of which you may not need right now.

A more sensible solution is to create a generator, which calculates a single value in the sequence, then returns control back to the caller. This allows the generator to be called again, resuming the sequence generation where it left off. Here's the Fibonacci function, implemented as a generator:



In [ ]:

    
def gfibonacci(size):
    a, b = 0, 1
    for _ in range(size):
        yield a
        a, b = b, a + b

Notice that there is no return statement at all; just yield, which is where a value is returned each time one is requested. The yield statement is what defines a generator.

When we call our generator, rather than a sequence of Fibonacci numbers, we get a generator object:



In [ ]:

    
f = gfibonacci(100)
f

A generator has a __next__() method that can be called via the builtin function next(). The call to next executes the generator until the yield statement is reached, returning the next generated value, and then pausing until another call to next occurs:



In [ ]:

    
next(f), next(f), next(f)

A generator is a type of iterator. If we call a function that supports iterables using a generator as an argument, it will know how to use the generator.



In [ ]:

    
np.array(list(f))

What happens when we reach the "end" of a generator?



In [ ]:

    
a_few_fibs = gfibonacci(2)



In [ ]:

    
next(a_few_fibs)



In [ ]:

    
next(a_few_fibs)



In [ ]:

    
next(a_few_fibs)

Thus, generators signal when there are no further values to generate by throwing a StopIteration exception. We must either handle this exception, or create a generator that is infinite, which we can do in this example by replacing a for loop with a while loop:



In [ ]:

    
def infinite_fib():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b



In [ ]:

    
f = infinite_fib()
vals = [next(f) for _ in range(10000)]
vals[-1]

Error Handling

Inevitably, some code you write will generate errors, at least in some situations. Unless we explicitly anticipate and handle these errors, they will cause your code to halt (sometimes this is a good thing!). Errors are handled using try/except blocks.

If code executed in the try block generates an error, code execution moves to the except block. If the exception that is specified corresponsd to that which has been raised, the code in the except block is executed before continuing; otherwise, the exception is carried out and the code is halted.



In [ ]:

    
absval(-5)

In the call to absval, we passed a single negative integer, whereas the function expects some sort of iterable data structure. Other than changing the function itself, we can avoid this error using exception handling.



In [ ]:

    
x = -5
try:
    print(absval(x))
except TypeError:
    print('The argument to absval must be iterable!')



In [ ]:

    
x = -5
try:
    print(absval(x))
except TypeError:
    print(absval([x]))

We can raise exceptions manually by using the raise expression.



In [ ]:

    
raise ValueError('This is the wrong value')

Importing and Manipulating Data

Python includes operations for importing and exporting data from files and binary objects, and third-party packages exist for database connectivity. The easiest way to import data from a file is to parse delimited text file, which can usually be exported from spreadsheets and databases. In fact, file is a built-in type in python. Data may be read from and written to regular files by specifying them as file objects:



In [ ]:

    
microbiome = open('../data/microbiome.csv')

Here, a file containing microbiome data in a comma-delimited format is opened, and assigned to an object, called microbiome. The next step is to transfer the information in the file to a usable data structure in Python. Since this dataset contains four variables, the name of the taxon, the patient identifier (de-identified), the bacteria count in tissue and the bacteria count in stool, it is convenient to use a dictionary. This allows each variable to be specified by name.

First, a dictionary object is initialized, with appropriate keys and corresponding lists, initially empty. Since the file has a header, we can use it to generate an empty dict:



In [ ]:

    
column_names = next(microbiome).rstrip('\n').split(',')
column_names

Compatibility Corner: In Python 2, open would not return a generator, but rather a file object with a next method. In Python 3, an generator is returned, which requires the use of the built-in function next.



In [ ]:

    
mb_dict = {name:[] for name in column_names}



In [ ]:

    
mb_dict



In [ ]:

    
for line in microbiome:
    taxon, patient, group, tissue, stool = line.rstrip('\n').split(',')
    mb_dict['Taxon'].append(taxon)
    mb_dict['Patient'].append(int(patient))
    mb_dict['Group'].append(int(group))
    mb_dict['Tissue'].append(int(tissue))
    mb_dict['Stool'].append(int(stool))

For each line in the file, data elements are split by the comma delimiter, using the split method that is built-in to string objects. Each datum is subsequently appended to the appropriate list stored in the dictionary. After all the data is parsed, it is polite to close the file:



In [ ]:

    
microbiome.close()

The data can now be readily accessed by indexing the appropriate variable by name:



In [ ]:

    
mb_dict['Tissue'][:10]

A second approach to importing data involves interfacing directly with a relational database management system. Relational databases are far more efficient for storing, maintaining and querying data than plain text files or spreadsheets, particularly for large datasets or multiple tables. A number of third parties have created packages for database access in Python. For example, sqlite3 is a package that provides connectivity for SQLite databases:



In [ ]:

    
import sqlite3
db = sqlite3.connect(database='../data/baseball-archive-2011.sqlite')

# create a cursor object to communicate with database
cur = db.cursor()



In [ ]:

    
# run query
cur.execute('SELECT playerid, HR, SB FROM Batting WHERE yearID=1970')

# fetch data, and assign to variable
baseball = cur.fetchall() 
baseball[:10]

Functions

Python uses the def statement to encapsulate code into a callable function. Here again is a very simple Python function:



In [ ]:

    
# Function for calulating the mean of some data
def mean(data):

    # Initialize sum to zero
    sum_x = 0.0

    # Loop over data
    for x in data:

        # Add to sum
        sum_x += x 
    
    # Divide by number of elements in list, and return
    return sum_x / len(data)

As we can see, arguments are specified in parentheses following the function name. If there are sensible "default" values, they can be specified as a keyword argument.



In [ ]:

    
def var(data, sample=True):

    # Get mean of data from function above
    x_bar = mean(data)

    # Do sum of squares in one line
    sum_squares = sum([(x - x_bar)**2 for x in data])

    # Divide by n-1 and return
    if sample:
        return sum_squares/(len(data)-1)
    return sum_squares/len(data)

Non-keyword arguments must always predede keyword arguments, and must always be presented in order; order is not important for keyword arguments.

Arguments can also be passed to functions as a tuple/list/dict using the asterisk notation.



In [ ]:

    
def some_computation(a=-1, b=4.3, c=7):
    return (a + b) / float(c)

args = (5, 4, 3)
some_computation(*args)



In [ ]:

    
kwargs = {'b':4, 'a':5, 'c':3}
some_computation(**kwargs)

The lambda statement creates anonymous one-line functions that can simply be assigned to a name.



In [ ]:

    
import numpy as np
normalize = lambda data: (np.array(data) - np.mean(data)) / np.std(data)

or not:



In [ ]:

    
(lambda data: (np.array(data) - np.mean(data)) / np.std(data))([5,8,3,8,3,1,2,1])

Python has several built-in, higher-order functions that are useful.



In [ ]:

    
list(filter(lambda x: x > 5, range(10)))



In [ ]:

    
abs([5,-6])



In [ ]:

    
list(map(abs, [5, -6]))

Example: Least Squares Estimation

Lets try coding a statistical function. Suppose we want to estimate the parameters of a simple linear regression model. The objective of regression analysis is to specify an equation that will predict some response variable $Y$ based on a set of predictor variables $X$. This is done by fitting parameter values $\beta$ of a regression model using extant data for $X$ and $Y$. This equation has the form:

$$Y = X\beta + \epsilon$$

where $\epsilon$ is a vector of errors. One way to fit this model is using the method of least squares, which is given by:

$$\hat{\beta} = (X^{\prime} X)^{-1}X^{\prime} Y$$

We can write a function that calculates this estimate, with the help of some functions from other modules:



In [ ]:

    
from numpy.linalg import inv
from numpy import transpose, array, dot

We will call this function solve, requiring the predictor and response variables as arguments. For simplicity, we will restrict the function to univariate regression, whereby only a single slope and intercept are estimated:



In [ ]:

    
def solve(x,y):
    'Estimates regession coefficents from data'

    '''
    The first step is to specify the design matrix. For this, 
    we need to create a vector of ones (corresponding to the intercept term, 
    and along with x, create a n x 2 array:
    '''
    X = array([[1]*len(x), x])

    '''
    An array is a data structure from the numpy package, similar to a list, 
    but allowing for multiple dimensions. Next, we calculate the transpose of x, 
    using another numpy function, transpose
    '''
    Xt = transpose(X)

    '''
    Finally, we use the matrix multiplication function dot, also from numpy 
    to calculate the dot product. The inverse function is provided by the LinearAlgebra 
    package. Provided that x is not singular (which would raise an exception), this 
    yields estimates of the intercept and slope, as an array
    '''
    b_hat = dot(inv(dot(X,Xt)), dot(X,y))

    return b_hat

Here is solve in action:



In [ ]:

    
solve((10,5,10,11,14),(-4,3,0,23,0.6))

Object-oriented Programming

As previously stated, Python is an object-oriented programming (OOP) language, in contrast to procedural languages like FORTRAN and C. As the name implies, object-oriented languages employ objects to create convenient abstractions of data structures. This allows for more flexible programs, fewer lines of code, and a more natural programming paradigm in general. An object is simply a modular unit of data and associated functions, related to the state and behavior, respectively, of some abstract entity. Object-oriented languages group similar objects into classes. For example, consider a Python class representing a bird:



In [ ]:

    
class Bird:
    # Class representing a bird

    name = 'bird'
    
    def __init__(self, sex):
        # Initialization method
        
        self.sex = sex

    def fly(self):
        # Makes bird fly

        print('Flying!')
        
    def nest(self):
        # Makes bird build nest

        print('Building nest ...')
        
    @classmethod
    def get_name(cls):
        # Class methods are shared among instances
        
        return cls.name

You will notice that this bird class is simply a container for two functions (called methods in Python), fly and nest, as well as one attribute, name. The methods represent functions in common with all members of this class. You can run this code in Python, and create birds:



In [ ]:

    
Tweety = Bird('male')
Tweety.name



In [ ]:

    
Tweety.fly()



In [ ]:

    
Foghorn = Bird('male')
Foghorn.nest()

A classmethod can be called without instantiating an object.



In [ ]:

    
Bird.get_name()

Whereas standard methods cannot:



In [ ]:

    
Bird.fly()

As many instances of the bird class can be generated as desired, though it may quickly become boring. One of the important benefits of using object-oriented classes is code re-use. For example, we may want more specific kinds of birds, with unique functionality:



In [ ]:

    
class Duck(Bird):
    # Duck is a subclass of bird

    name = 'duck'
    
    def swim(self):
        # Ducks can swim

        print('Swimming!')

    def quack(self,n):
        # Ducks can quack
    
        print('Quack! ' * n)

Notice that this new duck class refers to the bird class in parentheses after the class declaration; this is called inheritance. The subclass duck automatically inherits all of the variables and methods of the superclass, but allows new functions or variables to be added. In addition to flying and best-building, our duck can also swim and quack:



In [ ]:

    
Daffy = Duck('male')
Daffy.swim()



In [ ]:

    
Daffy.quack(3)



In [ ]:

    
Daffy.nest()

Along with adding new variables and methods, a subclass can also override existing variables and methods of the superclass. For example, one might define fly in the duck subclass to return an entirely different string. It is easy to see how inheritance promotes code re-use, sometimes dramatically reducing development time. Classes which are very similar need not be coded repetitiously, but rather, just extended from a single superclass.

This brief introduction to object-oriented programming is intended only to introduce new users of Python to this programming paradigm. There are many more salient object-oriented topics, including interfaces, composition, and introspection. I encourage interested readers to refer to any number of current Python and OOP books for a more comprehensive treatment.

In Python, everything is an object

Everything (and I mean everything) in Python is an object, in the sense that they possess attributes, such as methods and variables, that we usually associate with more "structured" objects like those we created above.

Check it out:



In [ ]:

    
dir(1)



In [ ]:

    
(1).bit_length()

This has implications for how assignment works in Python.

Let's create a trivial class:



In [ ]:

    
class Thing: pass

and instantiate it:



In [ ]:

    
x = Thing()
x

Here, x is simply a "label" for the object that we created when calling Thing. That object resides at the memory location that is identified when we print x. Notice that if we create another Thing, we create an new object, and give it a label. We know it is a new object because it has its own memory location.



In [ ]:

    
y = Thing()
y

What if we assign x to z?



In [ ]:

    
z = x
z

We see that the object labeled with z is the same as the object as that labeled with x. So, we say that z is a label (or name) with a binding to the object created by Thing.

So, there are no "variables", in the sense of a container for values, in Python. There are only labels and bindings.



In [ ]:

    
x.name = 'thing x'



In [ ]:

    
z.name

This can get you into trouble. Consider the following (seemingly inoccuous) way of creating a dictionary of emtpy lists:



In [ ]:

    
evil_dict = dict.fromkeys(column_names, [])
evil_dict



In [ ]:

    
evil_dict['Tissue'].append(5)



In [ ]:

    
evil_dict

Why did this happen?